[rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler by jrmadsen · Pull Request #4276 · ROCm/rocm-systems

jrmadsen · 2026-03-20T23:37:00Z

Motivation

Rewrites how rocprofiler-sdk handles the signal creation and signal async handlers in queue interception.

Technical Details

Creates an initial batch of 4096 signals and creates new batches of 4096 as needed.
Only assigns async signal handler to last packet in a batch of packets

JIRA ID

Test Plan

Ideally, this just improves performance and any breakages will be detected in the existing tests.
Developing a test to prevent a performance regression will difficult.

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR refactors rocprofiler-sdk’s HSA queue write interception and async signal handling to reduce per-dispatch overhead by batching per-packet state and introducing pooled/batched HSA signals.

Changes:

Introduces packet_data_t and updates completion callbacks to operate on per-packet data rather than session-wide fields.
Adds a pooled signal infrastructure (pool/pool_object) and rewires queue interception to allocate/reuse signals in batches.
Adds a new HIP test binary (hip-graph-bubbles) intended to create many graph-based kernel dispatches.

Reviewed changes

Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/hip-graph-bubbles.cpp	New test program that builds/launches a HIP graph repeatedly with roctx ranges.
projects/rocprofiler-sdk/tests/bin/hip-graph-bubbles/CMakeLists.txt	Build rules for the new `hip-graph-bubbles` test binary.
projects/rocprofiler-sdk/tests/bin/CMakeLists.txt	Adds `hip-graph-bubbles` subdirectory to the test build.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/tracing/fwd.hpp	Changes external correlation map type to a small_vector-backed container.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.hpp	Updates `post_kernel_call` signature to take `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/thread_trace/core.cpp	Threads `packet_data_t.user_data` through post-dispatch data iteration.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/tests/pc_sampling_internals.hpp	Updates session type name references for completion callback signatures.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/pc_sampling/hsa_adapter.cpp	Adapts to renamed session type and small_vector external correlation map.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.hpp	Updates dispatch tracing APIs to use `queue_info_session_t` + `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/kernel_dispatch/tracing.cpp	Moves dispatch callback inputs from session-wide to per-packet storage.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/signal.hpp	Adds `signal_t` wrapper used by pooled signal objects.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue_info_session.hpp	Introduces `packet_data_t` and refactors session to hold a small_vector of packet data.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.hpp	Updates async completion callback signature and adds pooled-signal APIs.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp	Core refactor: batching packet data, pooled signals, and async handler changes.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/memory_allocation.cpp	Switches external correlation map alias to the new small_vector-backed type.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/async_copy.cpp	Switches external correlation map alias to the new small_vector-backed type.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/CMakeLists.txt	Adds `signal.hpp` to installed/compiled HSA headers list.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/tests/core.cpp	Updates tests for renamed session type and new completed_cb signature.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.hpp	Plumbs `packet_data_t` into callback processing params.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/sample_processing.cpp	Reads dispatch info/user_data/external corr IDs from `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.hpp	Updates completed callback signature to include `packet_data_t`.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dispatch_handlers.cpp	Passes `packet_data_t` through to sample processing.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/core.cpp	Updates controller callback wiring for new completed callback signature.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/code_object/code_object.cpp	Switches external correlation map alias to the new small_vector-backed type.
projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/buffer.cpp	Optimizes `get_buffer` lookup from linear scan to direct indexing.
projects/rocprofiler-sdk/source/lib/common/utility.hpp	Generalizes `get_val` to work with containers providing `find` (incl. small_vector pairs).
projects/rocprofiler-sdk/source/lib/common/mpl.hpp	Extends pair detection trait to expose `first_type` / `second_type`.
projects/rocprofiler-sdk/source/lib/common/container/static_vector.hpp	Simplifies emplace_back assignment path.
projects/rocprofiler-sdk/source/lib/common/container/stable_vector.hpp	Initializes members to defaults to avoid uninitialized state.
projects/rocprofiler-sdk/source/lib/common/container/small_vector.hpp	Adds map-like helpers for small_vector-of-pairs (`find`, `at`, `emplace`).
projects/rocprofiler-sdk/source/lib/common/container/record_header_buffer.cpp	Uses memset to clear only the used header range; adds `<cstring>`.
projects/rocprofiler-sdk/source/lib/common/container/pool_object.hpp	New pooled object wrapper with acquire/release semantics.
projects/rocprofiler-sdk/source/lib/common/container/pool.hpp	New pool implementation for batched reusable objects (used for signals).
projects/rocprofiler-sdk/source/lib/common/container/CMakeLists.txt	Adds new pool headers to the common container header list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T23:43:00Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+    if(auto* pool = get_signal_pool(); use_pool && pool && attribute == 0)
+    {
+        auto& _signal = pool->acquire(construct_hsa_signal, 0, 0, nullptr, attribute);
+        ROCP_FATAL_IF(!_signal.in_use()) << "Acquired signal from pool that is not in use";
+        *signal = _signal.get().value;
+        // ROCP_INFO << fmt::format("acquired signal {} from pool: hsa_signal_t{{.handle={}}}",
+        //                          _signal.index(),
+        //                          _signal.get().value.handle);
+        get_core_table()->hsa_signal_store_screlease_fn(_signal.get().value, 1);
+        return &_signal;
+    }


The pooled-signal path calls pool->acquire(construct_hsa_signal, ...), which creates a brand new HSA signal each time the pooled object is acquired. That overwrites the previous hsa_signal_t handle stored in the pool object without destroying it, effectively leaking HSA signals and defeating reuse.

For pooled signals, create the HSA signal once when the pool batch is constructed and only reset its value on reuse (e.g., via hsa_signal_store_*), or explicitly destroy the previous handle before re-creating it.

Copilot · 2026-03-20T23:43:00Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+    ROCP_INFO << fmt::format("WriteInterceptor called with pkt_count={}", pkt_count);
+
+    using callback_record_t = packet_data_t::callback_record_t;
+    using packet_vector_t   = common::container::small_vector<rocprofiler_packet, 512>;



ROCP_INFO logging inside WriteInterceptor will run on every intercepted queue write and can add significant overhead / log spam in performance-sensitive code paths.

Consider removing this log or demoting it to ROCP_TRACE / VLOG guarded by a debug flag.

Copilot · 2026-03-20T23:43:00Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+        ROCP_INFO << fmt::format("released signal {}: hsa_signal_t{{.handle={}}}",
+                                 signal->index(),
+                                 signal->get().value.handle);


Queue::release_signal emits an ROCP_INFO log for every pooled signal release. In the new batching design this can be thousands of logs per interceptor call and may negate the intended performance improvement.

Consider removing this log, lowering it to ROCP_TRACE, or guarding it behind a verbose/debug option.

Suggested change

ROCP_INFO << fmt::format("released signal {}: hsa_signal_t{{.handle={}}}",

signal->index(),

signal->get().value.handle);

ROCP_TRACE << fmt::format("released signal {}: hsa_signal_t{{.handle={}}}",

signal->index(),

signal->get().value.handle);

Copilot · 2026-03-20T23:43:01Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+
+    if(get_signal_pool()) get_signal_pool()->report_reuse();
 }


pool::report_reuse() logs at WARNING level and Queue::sync() calls it unconditionally. This will likely produce warnings during normal operation and can be noisy.

Consider lowering this to ROCP_TRACE/VLOG, or only reporting reuse under an explicit diagnostics flag.

Copilot · 2026-03-20T23:43:01Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

-            queue_info_session.kernel_pkt.ext_amd_aql_pm4.completion_signal);
-    }
+            hsa::get_core_table()->hsa_signal_store_screlease_fn(packet.interrupt_signal, -1);
+            ROCP_FATAL << "Destroying interrupt signal";


ROCP_FATAL << "Destroying interrupt signal" will unconditionally terminate the process whenever an interrupt signal is present. This looks like leftover debug logging and prevents normal cleanup.

Switch this to a non-fatal log (or remove it) so the interrupt signal can be destroyed without aborting.

Suggested change

ROCP_FATAL << "Destroying interrupt signal";

ROCP_DEBUG << "Destroying interrupt signal";

Copilot · 2026-03-20T23:43:01Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

 template <typename Integral>
 constexpr Integral
 bit_extract(Integral x, int first, int last)
 {
-    return (x >> first) & bit_mask<Integral>(0, last - first);
+    static_assert(std::is_integral<Integral>::value, "Integral type required");
+
+    auto&& bit_mask = [](int _first, int _last) {
+        ROCP_FATAL_IF(!(_last >= _first)) << fmt::format(
+            "[queue::bit_extract::bit_mask] -> invalid argument. last (={}) is not >= first (={})",
+            _last,
+            _first);
+
+        size_t num_bits = _last - _first + 1;
+        return ((num_bits >= sizeof(Integral) * 8) ? ~Integral{0}
+                                                   /* num_bits exceed the size of Integral */
+                                                   : ((Integral{1} << num_bits) - 1))
+               << _first;
+    };
+
+    return (x >> first) & bit_mask(0, last - first);
 }


bit_extract is declared constexpr but its body calls ROCP_FATAL_IF / fmt::format via the bit_mask lambda. That makes the function non-constexpr-suitable and can cause compilation errors.

Either drop the constexpr qualifier here, or rework this into a truly constexpr-safe implementation without runtime logging.

Copilot · 2026-03-20T23:43:01Z

projects/rocprofiler-sdk/source/lib/common/container/pool_object.hpp

+    bool expected = true;
+    auto val      = m_in_use.compare_exchange_strong(expected, false);
+
+    if(m_pool) m_pool->release(m_index);


pool_object::release() calls m_pool->release(m_index) even when the compare_exchange_strong fails (i.e., the object wasn't actually in use). That can push the same index into the available queue multiple times and lead to multiple threads acquiring the same pooled object.

Only return the index to the pool when the state transition from in-use -> free succeeds.

Suggested change

if(m_pool) m_pool->release(m_index);

if(val && m_pool) m_pool->release(m_index);

Copilot · 2026-03-20T23:43:01Z

projects/rocprofiler-sdk/source/lib/common/container/pool.hpp

+        auto  _read_lk = std::shared_lock<std::shared_mutex>{m_available_mtx};
+        auto& _obj     = m_pool.at(_idx.value());


pool::acquire() reads from m_pool without taking m_pool_mtx (it only holds m_available_mtx). Another thread can enter the "add a new batch" path and mutate m_pool under m_pool_mtx, causing a data race / invalid reference.

Take a shared lock on m_pool_mtx when accessing m_pool (and similarly in release() when calling m_pool.at(idx)), so reads and batch growth are properly synchronized.

Suggested change

auto _read_lk = std::shared_lock<std::shared_mutex>{m_available_mtx};

auto& _obj = m_pool.at(_idx.value());

// Protect read access to m_pool with a shared lock on m_pool_mtx

auto _pool_read_lk = std::shared_lock<std::shared_mutex>{m_pool_mtx};

auto& _obj = m_pool.at(_idx.value());

Copilot · 2026-03-20T23:43:02Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/hsa/queue.cpp

+        _core_api.hsa_signal_wait_relaxed_fn(_active_kernels,
+                                             HSA_SIGNAL_CONDITION_EQ,
+                                             0,
+                                             timeout_hint.count(),
+                                             HSA_WAIT_STATE_BLOCKED);


Queue::sync() now waits with a 1s timeout and ignores the return value from hsa_signal_wait_relaxed_fn. If the wait times out, sync() will return while kernels are still active, and the destructor may proceed to destroy resources prematurely.

Use an infinite/very large timeout (as before) or loop until the signal reaches the expected value, and handle the timeout/error case explicitly.

Suggested change

_core_api.hsa_signal_wait_relaxed_fn(_active_kernels,

HSA_SIGNAL_CONDITION_EQ,

0,

timeout_hint.count(),

HSA_WAIT_STATE_BLOCKED);

while(true)

{

const auto current_value = _core_api.hsa_signal_wait_relaxed_fn(

_active_kernels,

HSA_SIGNAL_CONDITION_EQ,

0,

timeout_hint.count(),

HSA_WAIT_STATE_BLOCKED);

// hsa_signal_wait_relaxed_fn returns the signal value. If it is zero,

// the condition HSA_SIGNAL_CONDITION_EQ 0 has been satisfied and all

// active kernels have completed.

if(current_value == 0)

{

break;

}

// Timeout or early wakeup occurred while kernels are still active.

// Log and continue waiting to avoid destroying resources prematurely.

ROCP_WARNING << fmt::format(

"Timeout while waiting for queue sync: {} kernels still active", current_value);

}

powderluv · 2026-03-21T17:47:36Z

I pulled this PR into a clean local worktree and tried it against the same HIP graph kernel-trace repro cases we have been using for the queue/signal issue.

One build caveat first: on my ROCm 7.13 / TheRock venv, the PR head (9113c23e9d) does not build as-is because it is missing the separate fmt.hpp memory-copy-op compatibility fix for ROCm 7.13. I applied only that minimal compatibility patch locally, with no queue/signal behavior changes on top of the PR, so I could test the runtime behavior.

With that single compatibility patch added, I still could not get the PR branch to pass the HIP graph repro:

1000 x 300 with --kernel-trace segfaulted very early, before any CSV output was written.
256 x 200 with --kernel-trace also segfaulted before any CSV output was written.

I put the exact compatibility patch and the two crash logs into a secret gist here:

https://gist.github.com/powderluv/f65f4560fe338effd090fd7dd57d833d

Files in the gist:

README.md
pr4276_rocm713_compat.patch
pr4276_k1000_i300_run.log
pr4276_k256_i200_run.log

So at least on this setup, this alternative implementation is not yet passing the existing HIP graph test cases.

powderluv · 2026-03-21T20:35:23Z

I pulled this into a clean workspace and iterated on top of the PR head locally. The updated branch is here:

Local commit stack on top of the PR branch:

d33b45eda3 rocprofiler-sdk: handle ROCm 7.13 memory copy op layouts
d88004b100 rocprofiler-sdk: avoid host-thread state on async queue callbacks
5e6cd16418 rocprofiler-sdk: prearm queue completion callbacks for hip graphs

What changed at a high level:

stopped using host-thread-only state on ROCr async-doorbell callbacks
skipped tool-side kernel rename / HIP stream external-correlation setup when there is no host correlation id
switched the queue completion path to fresh one-shot pre-armed slots
kept pre-armed handlers alive until a real queue session is attached
changed the queue completion async-handler condition from EQ -1 to LT 1, which was the turning point for actually draining dispatch completions on this HIP graph case

Validation on the HIP graph reproducer (--kernel-trace, CSV output):

256 x 20: passes, 5120 rows / 5120 unique dispatch ids
256 x 200: passes, 51200 rows / 51200 unique dispatch ids
1000 x 200: passes, 200000 rows / 200000 unique dispatch ids
2000 x 200: passes, 400000 rows / 400000 unique dispatch ids

This is materially different from the original state I tested earlier on this machine, where the branch either failed to build on the ROCm 7.13 venv or segfaulted / failed to emit profiler output on the same HIP graph kernel-trace cases.

powderluv · 2026-03-22T06:49:37Z

I added a local hotspot pass on the current pr4276-based branch using the HIP graph reproducer with queue-signal timing enabled.

Method:

staged local rocprofv3 from the current pr4276 workspace
ROCPROFILER_QUEUE_SIGNAL_TRACE=1
ROCPROFILER_QUEUE_SIGNAL_TRACE_PERIOD=65536
compared the first ~65536 traced-dispatch summary on two shapes:
- 2000 x 300
- 3000 x 200

The main result is that the async completion callback is not the dominant performance hotspot.

At the first summary window:

2000 x 300
- dispatch_setup_avg_us=32.306
- completion_avg_us=1.265
- create_avg_us=0.641
- register_avg_us=1.262
- enqueue_latency_avg_us=4152.152
- direct_create_calls=24406
3000 x 200
- dispatch_setup_avg_us=33.673
- completion_avg_us=1.384
- create_avg_us=0.696
- register_avg_us=1.202
- enqueue_latency_avg_us=4457.345
- direct_create_calls=23946

Interpretation:

enqueue-side WriteInterceptor work is roughly 24x-26x larger than the async completion callback work
hsa_amd_signal_create and hsa_amd_signal_async_handler are visible, but neither is the dominant cost by itself
callback subphases are small:
- get_dispatch_avg_us ~ 0.116-0.125
- dispatch_complete_avg_us ~ 0.367-0.406
- callback_avg_us ~ 0.070-0.072
the queue is still accumulating noticeable completion lag (enqueue_latency_avg_us ~ 4.1-4.5 ms), but the direct callback body is not expensive enough to explain the overall slowdown
the prearmed slot path still falls back to direct creates frequently (~24k misses in the first ~65k dispatches), so slot availability is still part of the picture

The next useful step is finer instrumentation inside WriteInterceptor itself, especially around:

correlation / external-correlation work
tracing enter/exit callback overhead
queue callback fanout on enqueue
packet transformation / serialization path
slot-acquire miss path versus ready-slot hit path

So the current evidence says: optimize enqueue-side setup first, not async callback execution.

powderluv · 2026-03-22T07:01:40Z

Follow-up hotspot note from a second local instrumentation pass on the HIP graph repro.

I split the enqueue-side dispatch_setup_avg_us bucket into non-overlapping pieces on the current local pr4276 worktree and sampled the first ~65536 traced dispatches of two shapes:

2000 x 300
- dispatch_setup_avg_us=45.849
- dispatch_packet_avg_us=0.180
- dispatch_signal_avg_us=44.611
  - dispatch_signal_create_avg_us=44.392
  - dispatch_signal_arm_avg_us=0.219
- completion_avg_us=0.849
- enqueue_latency_avg_us=5814.085
- direct_create_calls=8834 / 65532
3000 x 200
- dispatch_setup_avg_us=54.235
- dispatch_packet_avg_us=0.488
- dispatch_signal_avg_us=52.278
  - dispatch_signal_create_avg_us=52.044
  - dispatch_signal_arm_avg_us=0.233
- completion_avg_us=1.618
- enqueue_latency_avg_us=6893.844
- direct_create_calls=1064 / 65521

Takeaway:

The main enqueue-side hotspot is the completion-signal acquisition / creation stage in WriteInterceptor, not packet building and not async-handler arm/register.
Packet build is sub-0.5 us here.
Arm/register is only about 0.22-0.23 us.
Completion callback work is still small (<2 us).
The wider graph shape (3000x200) is slower mainly because the signal-create/acquire stage grows, and enqueue latency grows with it.

One nuance: the raw create_avg_us counter for hsa_amd_signal_create itself is still sub-1 us, so this larger dispatch_signal_create_avg_us bucket is measuring the broader completion-signal acquisition path, not just the raw runtime call in isolation. That points more toward ready-slot acquisition / fallback / surrounding queue bookkeeping than the async callback path.

powderluv · 2026-03-22T08:37:34Z

Follow-up after cleaning up the local diff and updating the comparison branch.

I pushed a cleaned queue-only commit on top of users/powderluv/pr4276-hip-graph-fix:

b0db72c610 rocprofiler-sdk: use a ready queue for prearmed signals

What changed in this cleanup:

kept only the ready-queue optimization for prearmed completion slots
dropped the temporary hotspot instrumentation
kept the header-side async_signal_* type placement needed for a clean rebuild in this branch layout

Validated from a clean rebuild/stage in the venv-backed environment at:

/data/anush/github/bubble/SWDEV-583475/stage/rocprofiler-sdk-pr4276-push

Wide HIP graph kernel-trace reruns on the cleaned stage:

3000 x 200: passed on rerun, full CSV written
- log: /data/anush/github/bubble/SWDEV-583475/logs/hip-graph-cleanpush-k3000-i200-rerun-20260322T080944Z/run.log
- csv: /data/anush/github/bubble/SWDEV-583475/profiles/hip-graph-cleanpush-k3000-i200-rerun-20260322T080944Z/rocprofv3/trace_kernel_trace.csv
- result: 600000 rows / 600000 unique Dispatch_Id
2000 x 300: passed, full CSV written
- log: /data/anush/github/bubble/SWDEV-583475/logs/hip-graph-cleanpush-k2000-i300-20260322T080957Z/run.log
- csv: /data/anush/github/bubble/SWDEV-583475/profiles/hip-graph-cleanpush-k2000-i300-20260322T080957Z/rocprofv3/trace_kernel_trace.csv
- result: 600000 rows / 600000 unique Dispatch_Id

One caveat: the first fresh 3000 x 200 run after the clean rebuild hit a one-off hip::stream::get_stream_id segfault:

/data/anush/github/bubble/SWDEV-583475/logs/hip-graph-cleanpush-k3000-i200-20260322T080840Z/run.log

That fault did not reproduce on the immediate rerun above, and the second wide case also passed. So the ready-queue throughput fix is on the branch now, but there is still some residual instability outside the queue ready-queue path that may need a separate follow-up.

jrmadsen and others added 4 commits March 18, 2026 11:57

Add reproducer for hipGraphLaunch GPU activity bubbles

9b7a5ae

Changed use of unordered map to small vector for optimization purposes

b460cd5

Pool implementation for hsa signals

6800398

Optimize usage of AsyncSignalHandler

6b7486a

jrmadsen requested review from a team as code owners March 20, 2026 23:37

Copilot AI review requested due to automatic review settings March 20, 2026 23:37

jrmadsen changed the title ~~Users/jrmadsen/optimize hsa write interceptor~~ [rocprofiler-sdk] Optimize HSA queue write interceptor and async signal handler Mar 20, 2026

github-actions bot added the project: rocprofiler-sdk label Mar 20, 2026

Copilot started reviewing on behalf of jrmadsen March 20, 2026 23:38 View session

systems-assistant bot added the organization: ROCm label Mar 20, 2026

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Formatting fixes

9113c23

	ROCP_FATAL << "Destroying interrupt signal";
	ROCP_DEBUG << "Destroying interrupt signal";

	if(m_pool) m_pool->release(m_index);
	if(val && m_pool) m_pool->release(m_index);

		auto _read_lk = std::shared_lock<std::shared_mutex>{m_available_mtx};
		auto& _obj = m_pool.at(_idx.value());

-        _core_api.hsa_signal_wait_relaxed_fn(_active_kernels,
-                                             HSA_SIGNAL_CONDITION_EQ,
-,
-                                             timeout_hint.count(),
-                                             HSA_WAIT_STATE_BLOCKED);
+        while(true)
+        {
+            const auto current_value = _core_api.hsa_signal_wait_relaxed_fn(
+                _active_kernels,
+                HSA_SIGNAL_CONDITION_EQ,
+,
+                timeout_hint.count(),
+                HSA_WAIT_STATE_BLOCKED);
+            // hsa_signal_wait_relaxed_fn returns the signal value. If it is zero,
+            // the condition HSA_SIGNAL_CONDITION_EQ 0 has been satisfied and all
+            // active kernels have completed.
+            if(current_value == 0)
+            {
+                break;
+            }
+            // Timeout or early wakeup occurred while kernels are still active.
+            // Log and continue waiting to avoid destroying resources prematurely.
+            ROCP_WARNING << fmt::format(
+                "Timeout while waiting for queue sync: {} kernels still active", current_value);
+        }

Conversation

jrmadsen commented Mar 20, 2026

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

powderluv commented Mar 21, 2026

Uh oh!

powderluv commented Mar 21, 2026

Uh oh!

powderluv commented Mar 22, 2026

Uh oh!

powderluv commented Mar 22, 2026

Uh oh!

powderluv commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants